39 research outputs found

    Vektorski prikaz riječi utemeljen na velikim mrežnim korpusima kao moćan leksikografski alat

    Get PDF
    The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.Projekt Aranea sadržava niz usporednih korpusa za 24 (uglavnom europskih) jezika. On pruža prikladan skup podataka za aplikacije za obradu prirodnoga jezika (nLP) koje zahtijevaju učenje na velikoj količini podataka. U radu se prikazuju modeli vektorskoga prikaza riječi koji su uspostavljeni učenjem na korpusima Aranea te mrežno sučelje kako bi se propitali modeli i vizualizirali rezultati. To može biti korisno za leksikografsku praksu, ali i u drugim područjima leksikografskoga proučavanja jer je vektorski prostor vjerodostojan model semantičkoga prostora značenja riječi. Postoje tri moguća modela: prvi za kombinaciju vrste riječi i leme, drugi za sirove forme riječi i treći koji se temelji na algoritmu fastText koji upotrebljava vektore na razini nižoj od riječi i nije ograničen na cijele riječi ili poznate riječi pri pronalaženju semantičkih odnosa. U radu se opisuju sučelje i osnovni modeli njegova funkcioniranja, ali se ne pokušava provesti iscrpna jezična analiza prikazanih primjera

    Bilingual Corpus - Digital Repository for Preservation of Language Heritage

    Get PDF
    The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation

    Naivno terminološko označivanje zakonskih tekstova u slovačkom – može li biti korisno?

    Get PDF
    Correct automatic terminological annotation of texts in a corpus can be sometimes a challenging task, especially for moderately or heavily inflected languages with relatively free word order. We explore the possibility of simple annotation based on sequence matching of lemmatized texts to annotate Slovak language corpus with IATE terminological entries. The accuracy of annotating legal language is very good when annotating multiword terms, while accuracy of single-word terms can be increased by applying simple filters based on word lengths and blacklisting most frequent false positives.Ispravna automatska terminološka anotacija tekstova u korpusu ponekad može biti izazovan zadatak, posebno za iznimno flektivne jezike s razmjerno slobodnim redoslijedom riječi. U članku istražujemo mogućnost jednostavne anotacije na temelju podudarnosti lematiziranih tekstova kako bi korpus slovačkoga jezika bio anotiran terminološkim zapisima IATE. Točnost anotacije višerječnih termina vrlo je dobra, dok se točnost jednorječnih termina može povisiti primjenom jednostavnih filtara na temelju duljine riječi i stavljanja na crnu listu najčešćih lažnih pozitivnih rezultata

    Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy

    Get PDF
    The Slovak language, as a “typical” Slavic language, belongs to the group of moderately inflected languages, with three or four genders, two grammatical numbers, all interacting with the inflections in somewhat complicated and unpredictable ways. The inflections are realized primarily by suffixes, but with many irregularities; one suffix encodes several relevant grammatical categories and the same suffix often reflects unrelated features in other words, a typical inflectional language not amenable to a heuristic analysis. Following these limitations, lemmatization is often an indispensable step in all kinds of text processing (starting with full-text search), and full morphosyntactic analysis or description (MSD) is the core of corpus linguistic research. Given the core importance of lemmatization and MSD in Slovak corpus linguistics, it is important to realize its limitations and recognize achievable accuracy. Since modern approaches aim to utilize deep learning and huge language models, we evaluate the accuracy of lemmatization + MSD in several common usage scenarios by comparing the state-of-the-art “classical” lemmatizer and MSD tagger MorhoDiTa, based on perceptron; and spaCy, using a multilingual BERT language model

    Translation equivalence of demonstrative pronouns in Bulgarian-Slovak parallel texts

    Get PDF
    Translation equivalence of demonstrative pronouns in Bulgarian-Slovak parallel textsIn this paper we describe our automatic analysis of several parallel Bulgarian-Slovak texts with the goal to obtain useful information about Slovak translation equivalents of (definite) articles and demonstrative pronouns in Bulgarian. Rather than focusing on individual translation equivalents, we present a method for automatic extraction and visualization of the translations. This can serve as a guide for pinpointing interesting features in specific translated documents and could be extended for other parts of speech or otherwise identifiable textual units

    Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus

    Get PDF
    Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel CorpusIn this paper the results of the automatic extraction and presentation of bilingual correspondences from Slovak-Bulgarian Parallel corpus are described. The equivalent phrases are extracted from sentence and word level automatically aligned corpus, filtered, indexed and presented in a dictionary-like interface. The bilingual dictionary database contains 80 thousand phrase pairs consisting of approximately 350 thousand words (per each language). Counting unique word forms, the size is 31 thousand in the Slovak part of the dictionary, 26 thousand in the Bulgarian part

    Web presentation of bilingual corpora (Slovak-Bulgarian and Bulgarian-Polish)

    Get PDF
    Web presentation of bilingual corpora (Slovak-Bulgarian and Bulgarian-Polish)In this paper we focus on the web-presentation of bilingual corpora in three Slavic languages and their possible applications. Slovak-Bulgarian and Bulgarian-Polish corpora are collected and developed as results of the collaboration in the frameworks of two joint research projects between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, from one side, and from the other side: Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Slavic Studies, Polish Academy of Sciences, coordinate by authors of this paper

    Main results of MONDILEX project

    Get PDF
    Main results of MONDILEX projectThe paper presents the results and recommendations of MONDILEX, a 7FP project that covered six Slavic languages: Bulgarian, Polish, Russian, Slovak, Slovene, and Ukrainian. The paper summarizes the research undertaken on standardisation and integration of Slavic language resources and on the establishment of a virtual organisation supporting research infrastructure for Slavic lexicography. The results should be useful for an implementation of a research infrastructure in the coming years